Skip to content

GH-3487: Fix PageReadStore buffer leak in ParquetFileReader#3488

Open
arouel wants to merge 1 commit intoapache:masterfrom
arouel:close-page-read-store
Open

GH-3487: Fix PageReadStore buffer leak in ParquetFileReader#3488
arouel wants to merge 1 commit intoapache:masterfrom
arouel:close-page-read-store

Conversation

@arouel
Copy link
Copy Markdown

@arouel arouel commented Apr 18, 2026

Rationale for this change

ParquetFileReader never closes the ColumnChunkPageReadStore it returns from readNextRowGroup(). When a subsequent call replaces currentRowGroup, the previous instance's ByteBufferReleaser is abandoned without releasing the compressed I/O buffers and any off-heap decompressed page buffers it holds. With the default HeapByteBufferAllocator this is masked by GC, but with a direct ByteBufferAllocator it becomes a hard native memory leak that grows with every row group read. InternalParquetRecordReader works around this by manually closing the PageReadStore before each read and in its own close(), but any direct caller of ParquetFileReader that does not replicate this pattern will leak buffers.

What changes are included in this PR?

A private closeCurrentRowGroup() method is added to ParquetFileReader that null-safely closes and nulls the currentRowGroup field. It is called in readNextRowGroup() and readNextFilteredRowGroup() before assigning the new row group, and currentRowGroup is included in the AutoCloseables.uncheckedClose() chain in close(). This brings the buffer lifecycle management into ParquetFileReader itself so all callers benefit automatically.

Are these changes tested?

The existing test suites in parquet-hadoop continue to pass. Additional tests got added to verify that PageReadStore buffers are properly released.

Are there any user-facing changes?

No API changes. Callers that already close the PageReadStore themselves (like InternalParquetRecordReader) will see a harmless double-close since ColumnChunkPageReadStore.close() is idempotent via ByteBufferReleaser. Callers that did not close the PageReadStore will now have their buffers released automatically, reducing memory usage.

Closes #3487

Why

ParquetFileReader never closes the ColumnChunkPageReadStore it returns from readNextRowGroup(). When a subsequent readNextRowGroup() call is made, the previous row group's reference is silently overwritten without releasing its buffers. Likewise, ParquetFileReader.close() does not close the last row group it returned. With the default heap allocator this leak is masked by GC, but with a direct ByteBufferAllocator (e.g., one backed by Arena.ofShared()) the compressed column chunk data and decompressed page buffers become a hard native memory leak that grows with every row group read.

What

Add a closeCurrentRowGroup() helper to ParquetFileReader that null-safely closes and nulls the currentRowGroup field. Call it in readNextRowGroup() and readNextFilteredRowGroup() before assigning the new row group, and include currentRowGroup in the AutoCloseables.uncheckedClose() chain in close(). This matches the lifecycle pattern already implemented manually by InternalParquetRecordReader but bakes it into ParquetFileReader itself so all direct callers benefit.
@arouel arouel changed the title GH-3487: Fix PageReadStore buffer leak in ParquetFileReader GH-3487: Fix PageReadStore buffer leak in ParquetFileReader Apr 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ParquetFileReader does not close previous PageReadStore on readNextRowGroup() or on close()

1 participant